Goto

Collaborating Authors

 tensorized transformer




Reviews: A Tensorized Transformer for Language Modeling

Neural Information Processing Systems

This code failed to compile, and had numerous confusing aspects, and the authors did not link to the actual code used in training the model. April 2019), but I could find no comparison with that work. However I would also like to see the total flops usage compared to the baseline, as flops are frequently the limiting factor for training and deployment of models.


Reviews: A Tensorized Transformer for Language Modeling

Neural Information Processing Systems

The reviewers agree that the proposed model is well motivated and that the reduction in parameters achieved is significant. As such this paper is worthy of publication. However the reviewers also note a number of issues with the clarity of the presentation, general grammatical errors, and errors in the accompanying code. All of these issues must be addressed before publication. It is also required to add a more complete evaluation across a range of parameters scales for the tensorised model and the baseline, and to include the total flops used.


A Tensorized Transformer for Language Modeling

Neural Information Processing Systems

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German).


A Tensorized Transformer for Language Modeling

Ma, Xindian, Zhang, Peng, Zhang, Shuai, Duan, Nan, Hou, Yuexian, Zhou, Ming, Song, Dawei

Neural Information Processing Systems

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.